8/6/2020

Pollen is a unique paleoclimate proxy

Most analysis are inherently non-spatial

  • Due to challenging data likelihood.

  • Extremely noisy relationship between data and process.


    • No obvious patterns when plotting the data.

    • Inverse prediction problem.

Modeling goal

  • Link the observed pollen counts to climate states during the modern period.

  • Use the learned relationship to predict unobserved climate state.

  • Generate climate histories that are local to the site of interest with uncertainty.

Setting the stage

  • Spatially explicit reconstructions of climate variables is important.


    • Many important ecological questions are local.

    • Predictions at all locations remove the effects of sampling bias in paleoclimate reconstruction.

  • Prior work – 4 sites and total compute time of approximately 28 hours.


    • Currently: 363 sites with total compute time less than 2 hours.

Data model

  • Sediment samples from a lake.

  • Take 1cm\(^3\) cubes along the length of the sediment core.

  • In each cube, researcher counts the first \(M\) pollen grains and identifies to species/OTU.

  • Raw data are counts of each species.

  • Compositional count data.

Data model

\[\begin{align*} \mathbf{y}(\mathbf{s}, t) & \sim \operatorname{Multinomial}(M(\mathbf{s}, t), \pi_{SB}(\boldsymbol{\eta}(\mathbf{s}, t))) \end{align*}\]


  • \(\mathbf{y}(\mathbf{s}, t)\) is a \(J\) dimensional vector of counts at site \(\mathbf{s} \in \mathcal{D}\) and time \(t \in \{1, \ldots, n_t\}\)

  • \(M(\mathbf{s}, t) = \sum_{j=1}^J \mathbf{y}_j(\mathbf{s}, t)\) is the total count observed.

  • Link the underlying climate states to the probability of observing species \(j\) through the latent variable \(\eta_j(\mathbf{s}, t)\)

Data model

  • \(\pi_{SB}(\boldsymbol{\eta}(\mathbf{s}, t))\) is a stick breaking transformation.


    • The map \(\pi_{SB}: \mathcal{R}^{J-1} \rightarrow \Delta^{J}\) transforms the \(J-1\) dimensional vector \(\boldsymbol{\eta}(\mathbf{s}, t)\) to the \(J\) dimensional unit simpled \(\mathcal{\Delta}^{J}\).

  • Other maps to the unit simplex could be used (i.e., multi-logit), but the stick-breaking map will be important later.

Data model (ignoring spatio-temporal indexing)

\[\begin{align*} [\mathbf{y} | M, \pi_{SB}(\boldsymbol{\eta})] & =\operatorname{Multinomial}(\mathbf{y} | M, \pi_{SB}(\boldsymbol{\eta})) \\ & = \prod_{j=1}^{J-1} \operatorname{binomial}(y_j | M_j, \tilde{\pi}_j) \\ & = \prod_{j=}^{J-1} {M_j \choose y_j} \frac{(e^{\eta_j})^{y_j}}{(1 + e^{\eta_j})^{M_j} } \end{align*}\]

  • \(M_j = M - \sum_{k < j} M_k\).

  • Define the partial probabilities \(\tilde{\pi}_j = \pi_{SB}(\boldsymbol{\eta})_j\) using the stick-breaking representation.

Data model (ignoring spatio-temporal indexing)

\[\begin{align*} \frac{(e^{\eta_j})^{y_j}}{(1 + e^{\eta_j})^{M_j} } & = 2^{-M_j} e^{\kappa(y_j) \eta_j} \int_0^\infty e^{- \omega_j \eta_j^2 / 2 } [\omega_j | M_j, 0] \,d \omega \end{align*}\]


  • The integral identity is proportional to the product representation of the multinomial distribution

  • The density \([\omega_j | M_j, 0]\) is a Pólya-gamma distribution \(PG(b, c)\)

Data model (ignoring spatio-temporal indexing)

  • With a prior \([\eta_j]\), joint density \([y_j, \eta_j]\) is

\[\begin{align*} [\eta_j, y_j] & = [\eta_j] {M_j \choose y_j} \frac{(e^{\eta_j})^{y_j}}{(1 + e^{\eta_j})^{M_j} }\\ & = \int_0^\infty [\eta_j] {M_j \choose y_j} 2^{-M_j} e^{\kappa(y_j) \eta_j} e^{- \omega_j \eta_j^2 / 2 } [\omega_j | M_j, 0] \,d \omega \end{align*}\]


where the integral defines a joint density over \([\eta_j, y_j, \omega_j]\).


Data model

  • Using this integral representation, we have

    \[\begin{align*} \omega_j | \eta_j, y_j & \sim \operatorname{PG( M_j, \eta_j)} \end{align*}\]

    which can be sampled using the exponential tilting property of the Pólya-gamma distribution


  • If \([\eta_j]\) is Gaussian, then \([\eta_j | \omega_j, y_j]\) is also Gaussian which enables conjugate sampling.

Data model

  • There is a cost:


    • Requires sampling the \(\omega_j\)s and these are computationally expensive.

    • However, these are also embarrassingly parallel.

    • Efficiently sampled using openMP parallelization.

  • For all examples tested so far with reduced-rank spatial processes, sampling Pólya-gamma random variables is the limiting computational cost.

Process model – functional relationship

\[\begin{align*} \eta_j(\mathbf{s}, t) = \color{blue}{\beta_{0j}} + \color{blue}{\beta_{1j}} \left( \mathbf{x}'(\mathbf{s}, t) \boldsymbol{\gamma} + \mathbf{w}'(\mathbf{s}) \boldsymbol{\alpha}(t) \right) + \color{blue}{\varepsilon(\mathbf{s}, t)} \end{align*}\]


  • \(\color{blue}{\beta_{0j}}\) and \(\color{blue}{\beta_{1j}}\) are regression coefficients with respect to the climate state \(\mathbf{Z}(\mathbf{s}, t) = \left( \mathbf{x}'(\mathbf{s}, t) \boldsymbol{\gamma} + \mathbf{w}'(\mathbf{s}) \boldsymbol{\alpha}(t) \right)\).

  • \(\color{blue}{\varepsilon(\mathbf{s}, t)} \stackrel{iid}{\sim} \operatorname{N}(0, \sigma^2_j)\) models overdispersion relative to the linear regression response.

Process model – climate process

\[\begin{align*} \eta_j(\mathbf{s}, t) = \beta_{0j} + \beta_{1j} \left( \color{red}{\mathbf{x}'(\mathbf{s}, t) \boldsymbol{\gamma}} + \color{purple}{\mathbf{w}'(\mathbf{s}) \boldsymbol{\alpha}(t)} \right) + \varepsilon(\mathbf{s}, t) \end{align*}\]

\(\color{red}{\mathbf{X}(t) = \begin{pmatrix} \mathbf{x}'(\mathbf{s}_1, t) \\ \vdots \\ \mathbf{x}'(\mathbf{s}_n, t) \end{pmatrix}}\) are fixed covariates (elevation, latitude, etc.).

  • We assume \(\color{red}{\mathbf{X}(t) \equiv \mathbf{X}}\) for all \(t\) although temporally varying covariates are possible (volcanic forcings, Milankovitch cycles, etc.).

  • \(\color{purple}{\mathbf{W} = \begin{pmatrix} \mathbf{w}'(\mathbf{s}_1) \\ \vdots \\ \mathbf{w}'(\mathbf{s}_n) \end{pmatrix}}\) are spatial basis functions with temporal random effects \(\color{purple}{\boldsymbol{\alpha}(t)}\).

  • \(\mathbf{Z}_0 \sim \operatorname{N} (\color{red}{\mathbf{X}'(1) \boldsymbol{\gamma}} + \color{purple}{\mathbf{W} \boldsymbol{\alpha}(1)}, \sigma^2_0 \mathbf{I})\) is the observed modern climate state.

Process model – \(\color{purple}{\mbox{random effects}}\)

  • Challenge: scaling to 1000s of sites and 15,000 years (\(\approx 60\) time increments)

  • Sparse, multiresolution representation with a dynamic linear model

\[\begin{align*} \color{purple}{\mathbf{w}'(\mathbf{s}) \boldsymbol{\alpha}(t)} = \color{purple}{\sum_{m=1}^M \mathbf{w}_m'(\mathbf{s}) \boldsymbol{\alpha}_m(t)} \end{align*}\]


  • \(\color{purple}{\mathbf{w}_m'(\mathbf{s})}\) is a Wendland basis over resolution \(m\)

Process model – \(\color{purple}{\mbox{random effects}}\)



Process model – \(\color{purple}{\mbox{random effects}}\)

\[\begin{align*} \color{purple}{\boldsymbol{\alpha}_m(t)} & \sim \operatorname{N} \left( \mathbf{A}_m \color{purple}{\boldsymbol{\alpha}_m(t-1)}, \tau^2 \mathbf{Q}_m^{-1} \right) \end{align*}\]


  • Assume the constraint \(\color{purple}{\boldsymbol{\alpha}'_m(t)} \mathbf{1} = 0\) for all \(m\) and \(t\)

  • \(\mathbf{A}_m\) is the evolution matrix (assume \(\mathbf{A}_m \equiv \rho \mathbf{I}\) for all \(m\))

  • \(\mathbf{Q}_m\) is the precision matrix constructed from either a CAR or SAR process over the \(m\)th resolution

Next steps

R packages

References

Holmström, Lasse, Liisa Ilvonen, Heikki Seppä, and Siim Veski. 2015. “A Bayesian Spatiotemporal Model for Reconstructing Climate from Multiple Pollen Records.” The Annals of Applied Statistics 9 (3): 1194–1225.

Linderman, Scott, Matthew Johnson, and Ryan P Adams. 2015. “Dependent Multinomial Models Made Easy: Stick-Breaking with the Pólya-Gamma Augmentation.” In Advances in Neural Information Processing Systems, 3456–64.

Nychka, Douglas, Soutir Bandyopadhyay, Dorit Hammerling, Finn Lindgren, and Stephan Sain. 2015. “A Multiresolution Gaussian Process Model for the Analysis of Large Spatial Datasets.” Journal of Computational and Graphical Statistics 24 (2): 579–99.

Polson, Nicholas G, James G Scott, and Jesse Windle. 2013. “Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables.” Journal of the American Statistical Association 108 (504): 1339–49.

Tipton, John R, Mevin B Hooten, Connor Nolan, Robert K Booth, and Jason McLachlan. 2019. “Predicting Paleoclimate from Compositional Data Using Multivariate Gaussian Process Inverse Prediction.” The Annals of Applied Statistics 13 (4): 2363–88.